Kevin Williams Business Analytics Final Project¶

  • Real Estate Data
  • Concordia University - Wisconsin
  • Professor John Fields

Project Information¶

  • Data was Obtained from Kaggle.com and represents actual rental property data scraped from Craigslist.
  • This project will use the OSEMN framework that stands for:
    • Obtain
    • Scrub
    • Explore
    • Model
    • Interpret
  • Our goal of this project is to perform experimental analysis on rental listings on Craigslist in the United States as a whole. Ultimately, we want to know what factors/variables impact the rent per month which leads into our research question.
  • For the feasibility of this project, I will remove a big chunk of the data set below - this will allow VS code to run smoother.

Loading Neccessary Packages¶

In [ ]:
pip install numpy scipy pandas matplotlib.pyplot plotly==5.24.1 statsmodels mlxtend
In [ ]:
pip install --upgrade plotly jupyterlab
In [3]:
import numpy as np 
import scipy as sci
import pandas as pd
import matplotlib.pyplot as plt 
import plotly.express as px 
import statsmodels as stats

More packages will be added later into the project

Part One: Obtain¶

In [4]:
DF1 = pd.read_csv(r'c:\Users\Kevin\Desktop\Bus_Data_Anys_prog_class\Real_Estate Data.csv')

Part Two: Scrub¶

In [5]:
DF1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384977 entries, 0 to 384976
Data columns (total 21 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       384977 non-null  int64  
 1   region                   384977 non-null  object 
 2   region_url               384977 non-null  object 
 3   price                    384977 non-null  int64  
 4   type                     384977 non-null  object 
 5   sqfeet                   384977 non-null  int64  
 6   beds                     384977 non-null  int64  
 7   baths                    384977 non-null  float64
 8   cats_allowed             384977 non-null  int64  
 9   dogs_allowed             384977 non-null  int64  
 10  smoking_allowed          384977 non-null  int64  
 11  wheelchair_access        384977 non-null  int64  
 12  electric_vehicle_charge  384977 non-null  int64  
 13  comes_furnished          384977 non-null  int64  
 14  laundry_options          305951 non-null  object 
 15  parking_options          244290 non-null  object 
 16  image_url                384977 non-null  object 
 17  description              384975 non-null  object 
 18  lat                      383059 non-null  float64
 19  long                     383059 non-null  float64
 20  state                    384977 non-null  object 
dtypes: float64(3), int64(10), object(8)
memory usage: 61.7+ MB
  • Dataset contains 21 columns.
  • Dataset contains 384,977 rows of data.

Checking for Missing Values¶

In [6]:
DF1.isna().sum()
Out[6]:
id                              0
region                          0
region_url                      0
price                           0
type                            0
sqfeet                          0
beds                            0
baths                           0
cats_allowed                    0
dogs_allowed                    0
smoking_allowed                 0
wheelchair_access               0
electric_vehicle_charge         0
comes_furnished                 0
laundry_options             79026
parking_options            140687
image_url                       0
description                     2
lat                          1918
long                         1918
state                           0
dtype: int64
  • Latitude and Longitude are missing 1,918 records.
    • Because this is such a low number, compared to the amount of data, it would be simpler to omit the rows from the dataset.
  • Parking options have 140,687 missing records; because this is much higher than anticipated, this column will be removed from the dataset. Imputing method on this column will create inaccuracy.
  • Laundry column will also be removed from dataset.

Selecting Relevant Variables¶

In [7]:
DF2 = DF1.iloc[:50000, 3:21]
In [8]:
DF2.drop('laundry_options', axis= 'columns', inplace= True)
DF2.drop('parking_options', axis= 'columns', inplace= True)
DF2.drop('description', axis= 'columns', inplace= True)
DF2.drop('image_url', axis= 'columns', inplace= True)

Removing missing data from lat and long columns¶

In [9]:
DF2 = DF2.dropna()
In [10]:
DF2.isna().sum()
Out[10]:
price                      0
type                       0
sqfeet                     0
beds                       0
baths                      0
cats_allowed               0
dogs_allowed               0
smoking_allowed            0
wheelchair_access          0
electric_vehicle_charge    0
comes_furnished            0
lat                        0
long                       0
state                      0
dtype: int64

Renaming Price Variable to "monthly_rent" and sqfeet variable to "squarefeet"¶

In [11]:
DF2.rename(columns= {'price' : 'monthly_rent', 'sqfeet': 'squarefeet'}, inplace= True)

Filtering data¶

In [12]:
# Filtering dataset

DF3 = DF2[(DF2['monthly_rent'] <= 12000) & (DF2['monthly_rent'] >= 500) & 
          (DF2['squarefeet'] >= 300) & (DF2['beds'] <= 5) & (DF2['beds'] >= 1) & (DF2['squarefeet']<= 5000) & (DF2['baths'] >= 1)]

Part Three: Explore¶

In [13]:
DF3.describe()
Out[13]:
monthly_rent squarefeet beds baths cats_allowed dogs_allowed smoking_allowed wheelchair_access electric_vehicle_charge comes_furnished lat long
count 47498.000000 47498.000000 47498.000000 47498.000000 47498.000000 47498.000000 47498.000000 47498.000000 47498.000000 47498.000000 47498.00000 47498.000000
mean 2330.226178 1267.278012 2.244705 1.798813 0.661502 0.648659 0.583267 0.131985 0.047286 0.076214 37.36345 -100.144068
std 779.363888 553.623447 0.925673 0.647688 0.473204 0.477394 0.493023 0.338477 0.212253 0.265343 5.77784 21.861932
min 1700.000000 300.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -43.53330 -163.894000
25% 1850.000000 916.000000 2.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 33.93770 -119.062000
50% 2100.000000 1134.000000 2.000000 2.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 38.11310 -104.595500
75% 2500.000000 1453.000000 3.000000 2.000000 1.000000 1.000000 1.000000 0.000000 0.000000 0.000000 40.82220 -77.401525
max 12000.000000 5000.000000 5.000000 7.500000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 64.99370 172.633000

Takeaways from descriptive statistics¶

  • The average monthly rent is $2,330.00. The median is $2,100; because the mean is higher than the median, this would mean the data is skewed to the right. This is entails that more data lies on the left side of the distribution.
  • The standard deviation of monthly rent is $779.00.
    • To determine if this is high - we will compare it to the mean using coefficient of variation below. Coefficient of variation is a way to measure how spread out values are in a dataset relative to the mean. A value greater than 1 is often considered high.
  • The average square-footage is 1,267. The median is 1,134 square-feet. Again, because the mean is higher than the median, this would tell us the data is skewed to the right.
    • We will also use the coefficient of variation to determine if the standard deviation of 553 square-feet is high.

Coefficient of variation¶

image.png

In [14]:
# CV for Monthly Rent

STDRENT = 779.00
MEANRENT = 2330.00

def CV_Rent():
    CVRent = round((STDRENT/MEANRENT) * 100)
    print('The coefficent of variation for monthly rent is:',CVRent,'%')

CV_Rent()
The coefficent of variation for monthly rent is: 33 %
In [15]:
#CV for Square Footage
STDFOOTAGE = 553.00
MEANFOOTAGE = 1267.00

def CV_Footage():
    CVFootage = round((STDFOOTAGE/MEANFOOTAGE) * 100)
    print('The coefficent of variation for monthly rent is:',CVFootage,'%')


CV_Footage()
The coefficent of variation for monthly rent is: 44 %

Data Visualization¶

In [ ]:
 
In [16]:
import plotly.io as pio
In [17]:
pio.renderers.default = 'notebook'
In [ ]:
pip install seaborn
In [19]:
import seaborn as sns

Histograms¶

In [20]:
px.histogram(DF3, x= 'monthly_rent',
             title= 'Histogram of Monthly Rent',
             color_discrete_sequence= ['green'])
In [21]:
sci.stats.skew(DF3['monthly_rent'])
Out[21]:
np.float64(3.763049120952408)
In [22]:
px.histogram(DF3, x= 'squarefeet',
             title= 'Histogram of Square Footage',
             color_discrete_sequence= ['green'])
In [23]:
sci.stats.skew(DF3['squarefeet'])
Out[23]:
np.float64(1.8898599855496985)

Map¶

In [24]:
def DataMap():
    MapData = px.scatter_mapbox(DF3,
                  lon= DF3['long'],
                  lat= DF3['lat'],
                  zoom= 4,
                  color= DF3['type'],
                  width= 1200,
                  height= 900,
                  size= DF3['monthly_rent'],
                  title= 'Map of Housing Data - This will help us visually view the location of our housing data')

    MapData.update_layout(mapbox_style = 'open-street-map')
    MapData.show()

DataMap()

Bar Charts¶

In [25]:
sns.set_style('dark')
In [26]:
sns.barplot(data= DF3, x= 'monthly_rent', y= 'type',  estimator= 'mean', errorbar= None, edgecolor = 'blue', 
            color= 'orange')
plt.title('Average Monthly Rent by Living Type')
Out[26]:
Text(0.5, 1.0, 'Average Monthly Rent by Living Type')
No description has been provided for this image
In [27]:
TypeMean = DF3.groupby('type').monthly_rent.mean()
TypeMean = pd.DataFrame(TypeMean)
TypeMean.sort_values(by= ['monthly_rent'], ascending= True).head(10)
Out[27]:
monthly_rent
type
apartment 2274.021998
townhouse 2331.984159
duplex 2340.254470
loft 2380.051095
cottage/cabin 2413.612903
in-law 2414.642857
flat 2448.105882
condo 2519.454401
house 2525.232280
manufactured 2694.555556
In [28]:
plt.figure(figsize= (10,10))
sns.barplot(data= DF3, x= 'monthly_rent', y= 'state',  estimator= 'mean', errorbar= None, color= 'orange',
            edgecolor = 'blue')
plt.title('Average Monthly Rent by State')
Out[28]:
Text(0.5, 1.0, 'Average Monthly Rent by State')
No description has been provided for this image
In [29]:
StateMean = DF3.groupby('state').monthly_rent.mean()
In [30]:
StateMean = pd.DataFrame(StateMean)
StateMean.sort_values(by= ['monthly_rent'], ascending= True)
Out[30]:
monthly_rent
state
sd 1921.454545
md 1968.771429
va 1985.210623
ut 1989.693333
nd 2005.985401
nh 2010.847887
mo 2012.500000
ak 2023.294393
in 2030.371429
nc 2034.089583
me 2086.276190
de 2097.258216
ky 2101.745098
ga 2107.144444
id 2108.207171
az 2125.946541
ia 2126.296552
ri 2133.069444
mi 2134.658444
nv 2144.086207
mt 2144.212329
ne 2146.837838
tx 2146.914316
ar 2151.018519
ms 2154.000000
wi 2157.928952
ct 2164.863636
or 2180.410256
oh 2183.477477
ks 2193.142857
pa 2197.734093
co 2208.196516
tn 2209.748677
mn 2217.412760
la 2218.109929
vt 2221.559055
fl 2246.187007
nj 2293.441423
ok 2296.074074
wy 2320.000000
al 2324.387755
dc 2324.754717
wa 2325.148827
ma 2332.834593
sc 2341.305600
nm 2390.942857
il 2433.630731
wv 2434.400000
ny 2493.355991
ca 2517.291421
hi 2573.647325

Scatter Plot¶

In [31]:
px.scatter(DF3, x= 'squarefeet', y= 'monthly_rent', color = 'type',
           title= 'Relationship between Monthly Rent & Sq Footage')

The Scatter plot appears to be noisy. We can use the "corr()" function below to see the correlation numerically.

Correlation¶

In [32]:
Correlation = DF3['squarefeet'].corr(DF3['monthly_rent'])
print(f' The correlation between Square feet and monthly rent is {Correlation}')
 The correlation between Square feet and monthly rent is 0.2521794550715152

Advance Statistical Analysis¶

We are also interested to see if there is a statistically significant difference between the living types and their respected average monthly price. To do this we will use Anova

In [33]:
ANOVA = DF3.groupby('type').mean(numeric_only=True)
ANOVA
Out[33]:
monthly_rent squarefeet beds baths cats_allowed dogs_allowed smoking_allowed wheelchair_access electric_vehicle_charge comes_furnished lat long
type
apartment 2274.021998 1082.162109 1.961174 1.653637 0.741835 0.708180 0.625153 0.160093 0.056968 0.055595 37.459169 -98.990983
condo 2519.454401 1235.983033 2.066278 1.839873 0.362672 0.380700 0.458643 0.098091 0.027041 0.216331 35.009757 -102.241810
cottage/cabin 2413.612903 1195.298387 2.016129 1.532258 0.354839 0.387097 0.403226 0.064516 0.016129 0.387097 34.321426 -113.338343
duplex 2340.254470 1392.162311 2.617607 1.773728 0.449794 0.478680 0.412655 0.114168 0.013755 0.141678 37.978051 -109.431757
flat 2448.105882 1214.664706 2.147059 1.500000 0.711765 0.652941 0.588235 0.382353 0.370588 0.135294 36.620680 -94.980686
house 2525.232280 2003.198595 3.390168 2.277669 0.406945 0.472103 0.485109 0.039407 0.012095 0.119651 37.419490 -104.168295
in-law 2414.642857 725.392857 1.357143 1.035714 0.071429 0.035714 0.178571 0.000000 0.035714 0.321429 35.161196 -121.439161
loft 2380.051095 1252.153285 1.788321 1.755474 0.722628 0.708029 0.489051 0.211679 0.087591 0.065693 38.415204 -86.580630
manufactured 2694.555556 1383.777778 2.622222 1.822222 0.422222 0.466667 0.511111 0.044444 0.000000 0.177778 38.010289 -111.687129
townhouse 2331.984159 1541.903331 2.682778 2.342405 0.650690 0.647847 0.473193 0.046304 0.025995 0.076767 37.638376 -99.251916
In [34]:
px.box(DF3, x='monthly_rent', y='type')
In [35]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels import *
In [36]:
#H0: There is no difference between the groups and thier monthly rent.
#HA: There is a difference between the groups and thier monthly rent.

new = ols('monthly_rent ~ type', data= DF3).fit()
an = sm.stats.anova_lm(new, typ = 1)
print(an)
               df        sum_sq       mean_sq          F         PR(>F)
type          9.0  4.778726e+08  5.309695e+07  88.871114  6.070463e-165
Residual  47488.0  2.837219e+10  5.974602e+05        NaN            NaN

Are ANOVA test returned a p-value less than .05, meaning there is a statistically significant difference between the groups. To find out where the difference is, we will continue the analysis.

In [37]:
pair_t = new.t_test_pairwise('type')
pair_t.result_frame
Out[37]:
coef std err t P>|t| Conf. Int. Low Conf. Int. Upp. pvalue-hs reject-hs
condo-apartment 245.432403 18.282265 13.424617 5.154720e-41 209.598908 281.265897 2.268077e-39 True
cottage/cabin-apartment 139.590905 69.539064 2.007374 4.471552e-02 3.293370 275.888440 7.346317e-01 False
duplex-apartment 66.232472 28.970160 2.286231 2.224514e-02 9.450554 123.014391 5.240192e-01 False
flat-apartment 174.083884 59.430008 2.929225 3.399695e-03 57.600240 290.567529 1.123627e-01 False
house-apartment 251.210282 9.754872 25.752290 3.033636e-145 232.090597 270.329966 1.365136e-143 True
in-law-apartment 140.620859 146.134587 0.962270 3.359191e-01 -145.804969 427.046686 9.982979e-01 False
loft-apartment 106.029097 66.170064 1.602373 1.090799e-01 -23.665152 235.723345 9.442854e-01 False
manufactured-apartment 420.533557 115.301114 3.647264 2.653313e-04 194.541766 646.525349 1.029592e-02 True
townhouse-apartment 57.962161 16.128452 3.593783 3.262407e-04 26.350169 89.574153 1.232261e-02 True
cottage/cabin-condo -105.841498 71.659013 -1.477016 1.396780e-01 -246.294162 34.611166 9.619229e-01 False
duplex-condo -179.199930 33.743190 -5.310699 1.096956e-07 -245.337053 -113.062808 4.387817e-06 True
flat-condo -71.348518 61.897161 -1.152695 2.490416e-01 -192.667817 49.970780 9.942312e-01 False
house-condo 5.777879 19.861777 0.290904 7.711257e-01 -33.151480 44.707238 9.998080e-01 False
in-law-condo -104.811544 147.155191 -0.712252 4.763124e-01 -393.237769 183.614681 9.997245e-01 False
loft-condo -139.403306 68.394518 -2.038223 4.153313e-02 -273.457514 -5.349097 7.198999e-01 False
manufactured-condo 175.101155 116.591936 1.501829 1.331479e-01 -53.420666 403.622975 9.619229e-01 False
townhouse-condo -187.470242 23.652899 -7.925889 2.314236e-15 -233.830253 -141.110230 9.719790e-14 True
duplex-cottage/cabin -73.358433 75.100220 -0.976807 3.286696e-01 -220.555911 73.839046 9.982979e-01 False
flat-cottage/cabin 34.492979 91.283617 0.377866 7.055318e-01 -144.424183 213.410141 9.998080e-01 False
house-cottage/cabin 111.619377 69.970925 1.595225 1.106686e-01 -25.524613 248.763366 9.442854e-01 False
in-law-cottage/cabin 1.029954 161.728446 0.006368 9.949188e-01 -315.960056 318.019963 9.998080e-01 False
loft-cottage/cabin -33.561808 95.808404 -0.350301 7.261141e-01 -221.347616 154.223999 9.998080e-01 False
manufactured-cottage/cabin 280.942652 134.518102 2.088512 3.675700e-02 17.285297 544.600008 6.868072e-01 False
townhouse-cottage/cabin -81.628744 71.140002 -1.147438 2.512064e-01 -221.064139 57.806651 9.942312e-01 False
flat-duplex 107.851412 65.850494 1.637822 1.014654e-01 -21.216473 236.919297 9.380694e-01 False
house-duplex 184.977809 29.991985 6.167575 6.990296e-10 126.193101 243.762518 2.866021e-08 True
in-law-duplex 74.388387 148.861275 0.499716 6.172773e-01 -217.381787 366.158560 9.998016e-01 False
loft-duplex 39.796624 71.991937 0.552793 5.804079e-01 -101.308576 180.901825 9.998016e-01 False
manufactured-duplex 354.301085 118.737983 2.983890 2.847537e-03 121.572984 587.029187 9.756407e-02 False
townhouse-duplex -8.270311 32.626502 -0.253484 7.998950e-01 -72.218709 55.678087 9.998080e-01 False
house-flat 77.126398 59.934755 1.286839 1.981566e-01 -40.346557 194.599352 9.849443e-01 False
in-law-flat -33.463025 157.646233 -0.212267 8.319000e-01 -342.451840 275.525789 9.998080e-01 False
loft-flat -68.054787 88.743984 -0.766866 4.431647e-01 -241.994233 105.884659 9.997245e-01 False
manufactured-flat 246.449673 129.581496 1.901889 5.719166e-02 -7.531866 500.431213 8.077548e-01 False
townhouse-flat -116.121723 61.295549 -1.894456 5.817057e-02 -236.261853 4.018407 8.077548e-01 False
in-law-house -110.589423 146.340584 -0.755699 4.498335e-01 -397.419007 176.240161 9.997245e-01 False
loft-house -145.181185 66.623767 -2.179120 2.932765e-02 -275.764698 -14.597672 6.142330e-01 False
manufactured-house 169.323276 115.562087 1.465215 1.428690e-01 -57.180025 395.826577 9.619229e-01 False
townhouse-house -193.248121 17.899048 -10.796558 3.842698e-27 -228.330505 -158.165737 1.652360e-25 True
loft-in-law -34.591762 160.308722 -0.215782 8.291585e-01 -348.799092 279.615567 9.998080e-01 False
manufactured-in-law 279.912698 186.050412 1.504499 1.324595e-01 -84.748703 644.574100 9.619229e-01 False
townhouse-in-law -82.658698 146.903151 -0.562675 5.736590e-01 -370.590923 205.273527 9.998016e-01 False
manufactured-loft 314.504461 132.807815 2.368117 1.788286e-02 54.199291 574.809630 4.585580e-01 False
townhouse-loft -48.066936 67.850540 -0.708424 4.786856e-01 -181.054941 84.921069 9.997245e-01 False
townhouse-manufactured -362.571396 116.273666 -3.118259 1.820316e-03 -590.469402 -134.673391 6.519101e-02 False

Results from ANOVA¶

  • If we examine the far right column where the values say "True", these are values that represent a significant difference.
    • Condo-apartment
    • House-apartment
    • Manufactured-apartment
    • Townhouse-apartment
    • Duplex-condo
    • Townhouse-condo
    • House-duplex
    • Townhouse-house

Part Four: Model¶

In [38]:
DF3.info()
<class 'pandas.core.frame.DataFrame'>
Index: 47498 entries, 229 to 49999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   monthly_rent             47498 non-null  int64  
 1   type                     47498 non-null  object 
 2   squarefeet               47498 non-null  int64  
 3   beds                     47498 non-null  int64  
 4   baths                    47498 non-null  float64
 5   cats_allowed             47498 non-null  int64  
 6   dogs_allowed             47498 non-null  int64  
 7   smoking_allowed          47498 non-null  int64  
 8   wheelchair_access        47498 non-null  int64  
 9   electric_vehicle_charge  47498 non-null  int64  
 10  comes_furnished          47498 non-null  int64  
 11  lat                      47498 non-null  float64
 12  long                     47498 non-null  float64
 13  state                    47498 non-null  object 
dtypes: float64(3), int64(9), object(2)
memory usage: 5.4+ MB

Multiple Regression Analysis¶

In [39]:
# Creating Dependent and Independent Variable Dataframes 

X = DF3.drop(['monthly_rent','lat', 'long', 'state'], axis= 'columns')

y = DF3['monthly_rent'] # We are looking to predict monthly rent.
In [40]:
# Creating dummy variables for type column

X = pd.get_dummies(X,  columns=['type'], drop_first= False)
print(X)
       squarefeet  beds  baths  cats_allowed  dogs_allowed  smoking_allowed  \
229          2200     2    2.0             0             1                1   
230          1984     2    2.0             0             0                0   
231          1984     2    2.0             0             0                0   
232          3450     3    3.5             0             0                0   
233          5000     5    4.5             1             1                1   
...           ...   ...    ...           ...           ...              ...   
49995        1250     2    2.0             1             1                0   
49996        1243     2    2.0             0             0                0   
49997        1250     2    2.0             1             1                0   
49998        1250     2    2.0             1             1                0   
49999        1190     2    2.0             1             1                1   

       wheelchair_access  electric_vehicle_charge  comes_furnished  \
229                    0                        0                0   
230                    0                        0                1   
231                    0                        0                1   
232                    0                        0                1   
233                    0                        0                0   
...                  ...                      ...              ...   
49995                  0                        0                0   
49996                  0                        0                0   
49997                  0                        0                0   
49998                  0                        0                0   
49999                  0                        0                0   

       type_apartment  type_condo  type_cottage/cabin  type_duplex  type_flat  \
229             False       False               False         True      False   
230             False        True               False        False      False   
231             False        True               False        False      False   
232              True       False               False        False      False   
233             False       False               False         True      False   
...               ...         ...                 ...          ...        ...   
49995            True       False               False        False      False   
49996           False        True               False        False      False   
49997            True       False               False        False      False   
49998            True       False               False        False      False   
49999            True       False               False        False      False   

       type_house  type_in-law  type_loft  type_manufactured  type_townhouse  
229         False        False      False              False           False  
230         False        False      False              False           False  
231         False        False      False              False           False  
232         False        False      False              False           False  
233         False        False      False              False           False  
...           ...          ...        ...                ...             ...  
49995       False        False      False              False           False  
49996       False        False      False              False           False  
49997       False        False      False              False           False  
49998       False        False      False              False           False  
49999       False        False      False              False           False  

[47498 rows x 19 columns]
In [41]:
# Converting string data to 0 and 1
X = X.iloc[:, 0:18].astype(int)
In [42]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 43)
In [43]:
from sklearn.linear_model import LinearRegression

OLS = LinearRegression()

OLS.fit(X_train, y_train)
Out[43]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [44]:
# Display the intercept and coefficients of the OLS model

print('Intercept is: ' + str(OLS.intercept_))
print()
print('The set of coefficients are: ' + str(OLS.coef_))
print()
print('The R-Squared value is: ' + str(OLS.score(X_train,y_train)))
Intercept is: 1603.423654424329

The set of coefficients are: [ 3.84598452e-01 -5.46718152e+01  1.42053228e+02 -3.04299590e+00
 -3.58969173e+01 -4.52971521e+01  1.41273363e+02  2.90259897e+02
  3.67705568e+02  1.27464318e+02  2.47651832e+02  1.27567977e+02
  7.33833465e+01  1.43217913e+02  2.15902467e+01  4.05557505e+02
  1.05804562e+02  4.61022101e+02]

The R-Squared value is: 0.11231675179290102
In [45]:
# Predicting with OLS

y_pred = OLS.predict(X_test)
performance = pd.DataFrame({'Predictions': y_pred, 'Actual Values': y_test})
performance['error'] = performance['Actual Values'] - performance['Predictions']
performance.head()
Out[45]:
Predictions Actual Values error
31123 2282.162678 1960 -322.162678
17539 2601.203160 2300 -301.203160
35431 2401.435551 1893 -508.435551
34746 2284.710567 1899 -385.710567
24613 2143.921643 2100 -43.921643
In [46]:
# Preparing data for plotting

performance.reset_index(drop = True, inplace= True)
performance.reset_index(inplace= True)
performance.head()
Out[46]:
index Predictions Actual Values error
0 0 2282.162678 1960 -322.162678
1 1 2601.203160 2300 -301.203160
2 2 2401.435551 1893 -508.435551
3 3 2284.710567 1899 -385.710567
4 4 2143.921643 2100 -43.921643
In [47]:
# Plot residual

fig = plt.figure(figsize=(15,5))
sns.barplot(data= performance[:50], x= 'index', y= 'error', color= 'blue')
plt.xlabel('Observations')
plt.ylabel('Residuals')
plt.show()
No description has been provided for this image

Interpreting Model Visualization¶

  • We can see that our model both over predicted and under-predicted. For example - only examining the first 50 observations, observations with a positive residual value means the model under-predicted. and a negative residual value means the model over-predicted. A model witht he error terms closer to 0 is a better model.
In [48]:
X_train = sm.add_constant(X_train)
X_train.head()

Model = sm.OLS(y_train, X_train).fit()
Model.summary()
Out[48]:
OLS Regression Results
Dep. Variable: monthly_rent R-squared: 0.112
Model: OLS Adj. R-squared: 0.112
Method: Least Squares F-statistic: 267.0
Date: Sat, 28 Sep 2024 Prob (F-statistic): 0.00
Time: 14:57:26 Log-Likelihood: -3.0473e+05
No. Observations: 37998 AIC: 6.095e+05
Df Residuals: 37979 BIC: 6.097e+05
Df Model: 18
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 1603.4237 23.796 67.381 0.000 1556.782 1650.065
squarefeet 0.3846 0.012 32.203 0.000 0.361 0.408
beds -54.6718 6.803 -8.037 0.000 -68.006 -41.338
baths 142.0532 8.810 16.124 0.000 124.785 159.321
cats_allowed -3.0430 15.923 -0.191 0.848 -34.253 28.167
dogs_allowed -35.8969 15.452 -2.323 0.020 -66.183 -5.611
smoking_allowed -45.2972 8.160 -5.551 0.000 -61.291 -29.303
wheelchair_access 141.2734 12.264 11.519 0.000 117.235 165.312
electric_vehicle_charge 290.2599 18.783 15.453 0.000 253.444 327.075
comes_furnished 367.7056 14.711 24.995 0.000 338.871 396.540
type_apartment 127.4643 17.722 7.192 0.000 92.729 162.200
type_condo 247.6518 25.513 9.707 0.000 197.646 297.657
type_cottage/cabin 127.5680 77.156 1.653 0.098 -23.659 278.795
type_duplex 73.3833 34.957 2.099 0.036 4.867 141.900
type_flat 143.2179 64.683 2.214 0.027 16.437 269.999
type_house 21.5902 19.722 1.095 0.274 -17.066 60.246
type_in-law 405.5575 154.655 2.622 0.009 102.429 708.686
type_loft 105.8046 71.316 1.484 0.138 -33.976 245.585
type_manufactured 461.0221 123.785 3.724 0.000 218.400 703.644
Omnibus: 29757.720 Durbin-Watson: 2.002
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1009487.779
Skew: 3.509 Prob(JB): 0.00
Kurtosis: 27.256 Cond. No. 5.67e+04


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.67e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Part Five: Interpret¶

Someone interested in finding a place to rent via Craigslist could use this report to make strategic decisions on the most affordable locations to rent including based on type of living situation. For example - we examine that South Dakota is the most affordable at an average monthly rent of $1,921.45 and Hawaii being the most expensive at $2,573.64.One can also view the interactive map in which was created using the latitude and longitude data to give a precise location of the properties.

Furthermore, one can examine the average rent based on housing type while viewing if the means are statistically different by examining the ANOVA section of the report.

This report can also be used for someone interested in either real estate investing or just renting out their property. Thos individuals can examine not only the exploration section, but also the model section to see what features have an impact on monthly rent. However, this isnt the best model for two reasons:

  1. The R-square is low at .11 meaning that only 11% of the variation in monthly rent can be explained by the independent variables in the model. This means there are other variables out there thats not in the model that may explain more of the variation of monthly rent.
  2. The residual plot shows shows variation in the model, in otherwords the model under predicted and over predicted the monthly rent price.